This report analyzes the City Lifestyle Segmentation dataset to: 1. Identify lifestyle-based clusters of world cities using PCA and clustering. 2. Compare developed vs. developing regions. 3. Explore the relationship between economic and environmental factors.
data_path <- "../data/city_lifestyle.csv"
city <- readr::read_csv(data_path)
## Rows: 300 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): city_name, country
## dbl (8): population_density, avg_income, internet_penetration, avg_rent, air...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Quick summary
summary(city)
## city_name country population_density avg_income
## Length:300 Length:300 Min. : 100 Min. : 480
## Class :character Class :character 1st Qu.: 1830 1st Qu.:1908
## Mode :character Mode :character Median : 3084 Median :2810
## Mean : 3945 Mean :2827
## 3rd Qu.: 4824 3rd Qu.:3752
## Max. :14427 Max. :5720
## internet_penetration avg_rent air_quality_index public_transport_score
## Min. : 34.00 Min. : 170 Min. : 22.00 Min. :15.00
## 1st Qu.: 64.40 1st Qu.: 640 1st Qu.: 54.00 1st Qu.:46.08
## Median : 75.00 Median : 990 Median : 67.50 Median :54.70
## Mean : 74.31 Mean :1003 Mean : 71.25 Mean :55.72
## 3rd Qu.: 87.22 3rd Qu.:1332 3rd Qu.: 86.00 3rd Qu.:64.20
## Max. :100.00 Max. :2430 Max. :146.00 Max. :95.00
## happiness_score green_space_ratio
## Min. :2.500 Min. : 2.00
## 1st Qu.:5.300 1st Qu.:28.23
## Median :6.900 Median :34.70
## Mean :6.644 Mean :33.99
## 3rd Qu.:8.500 3rd Qu.:40.40
## Max. :8.500 Max. :58.00
cat("Number of cities:", nrow(city), "\n")
## Number of cities: 300
cat("Number of features:", ncol(city), "\n")
## Number of features: 10
missing_summary <- colSums(is.na(city))
missing_summary[missing_summary > 0]
## named numeric(0)
city_num <- city |> dplyr::select(where(is.numeric))
city_num |>
tidyr::pivot_longer(cols = everything(), names_to = "variable", values_to = "value") |>
ggplot(aes(x = value)) +
geom_histogram(bins = 20, fill = "#2C79B8", alpha = 0.7) +
facet_wrap(~ variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric City Features",
x = "Value",
y = "Count")
corr_mat <- cor(city_num, use = "pairwise.complete.obs")
ggcorrplot(corr_mat,
method = "square",
type = "lower",
lab = FALSE,
outline.color = "white",
show.legend = TRUE) +
labs(title = "Correlation Matrix of City Lifestyle Features")
Several meaningful relationships are observed:
These patterns highlight how economic development, environmental quality, and public services jointly shape the lifestyle structure of cities.
ggplot(city, aes(x = population_density, y = happiness_score)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
theme_minimal() +
labs(title = "Happiness vs Population Density",
x = "Population density",
y = "Happiness score")
## `geom_smooth()` using formula = 'y ~ x'
Cities with higher population density tend to report lower happiness
levels on average. - This suggests that overcrowding may reduce
quality of life due to factors such as congestion, limited public space,
noise, and increased stress. - However, the considerable vertical spread
of points also indicates that happiness is influenced by additional
factors beyond density, such as economic resources and environmental
quality.
ggplot(city,
aes(x = avg_income,
y = air_quality_index,
size = green_space_ratio)) +
geom_point(alpha = 0.7, color = "#1B7F79") +
theme_minimal() +
scale_size_continuous(name = "Green space ratio") +
labs(title = "Air Quality vs Income (Bubble size = Green Space)",
x = "Average income",
y = "Air quality index")
A noticeable downward trend emerges: higher-income
cities generally exhibit lower air quality index values, indicating
better air quality (since lower index = less pollution).
In addition, larger bubbles are more frequently observed among cities with higher incomes, suggesting that wealthier cities also tend to feature more green infrastructure.
Combined, these results imply that economic resources enable improved environmental management, including investments in green spaces and pollution control.
ggplot(city, aes(x = country, y = happiness_score, fill = country)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = "Happiness Score by Country",
x = "Country",
y = "Happiness score") +
theme(legend.position = "none")
- Europe and North America show the highest median happiness scores,
with relatively small variability, indicating consistently high quality
of life in these developed regions. - Oceania also performs strongly,
though sample size appears smaller. - Asia and South America have
moderate happiness levels with wider spreads, suggesting uneven
development and quality-of-life disparities within regions. - Africa
exhibits the lowest median happiness score, reflecting challenges
related to economic development and social wellbeing.
city_num <- city |> dplyr::select(where(is.numeric))
city_scaled <- scale(city_num)
pca_res <- prcomp(city_scaled, scale. = FALSE)
pca_var <- pca_res$sdev^2
pca_var_ratio <- pca_var / sum(pca_var)
pca_var_ratio
## [1] 0.538058042 0.258331158 0.073279421 0.054527945 0.036307051 0.024752029
## [7] 0.008892435 0.005851919
The first principal component accounts for 53.8% of the variance, while the second accounts for an additional 25.8%. Together, PC1 and PC2 explain 79.6% of the total variance.
plot(pca_var_ratio,
xlab = "Principal Component",
ylab = "Variance Explained",
main = "Scree Plot of PCA",
type = "b")
Thus, we retain the first 4 components for clustering and visualization
to ensure an optimal balance between information preservation and
dimensionality reduction.
pca_scores <- as.data.frame(pca_res$x[,1:2])
colnames(pca_scores) <- c("PC1", "PC2")
ggplot(pca_scores, aes(x = PC1, y = PC2)) +
geom_point(alpha = 0.7, color = "#006699") +
theme_minimal() +
labs(title = "Cities in PCA Space",
x = "PC1", y = "PC2")
## K-Means Clustering ### Perform K-means Clustering on first 4 PCs
set.seed(42)
pca_for_cluster <- as.data.frame(pca_res$x[,1:4]) # use the first four components
k3 <- kmeans(pca_for_cluster, centers = 3, nstart = 25)
city$cluster <- factor(k3$cluster)
pca_scores$cluster <- factor(k3$cluster)
table(city$cluster)
##
## 1 2 3
## 86 142 72
ggplot(pca_scores, aes(PC1, PC2, color = cluster)) +
geom_point(alpha = 0.8, size = 3) +
theme_minimal() +
labs(title = "City Lifestyle Clusters in PCA Space",
color = "Cluster")
### Cluster visualization in PCA Space (3D)
pca_3d <- as.data.frame(pca_res$x[,1:3])
colnames(pca_3d) <- c("PC1", "PC2", "PC3")
pca_3d$cluster <- city$cluster
plot_ly(pca_3d,
x = ~PC1,
y = ~PC2,
z = ~PC3,
color = ~cluster,
colors = c("#1B7F79", "#D95F02", "#7570B3"),
type = "scatter3d",
mode = "markers",
marker = list(size = 5, opacity = 0.85)) %>%
layout(
title = "3D PCA Clustering of City Lifestyles",
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
)
)
cluster_profile <- city |>
group_by(cluster) |>
summarise(across(where(is.numeric),
list(mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE)),
.names = "{.col}_{.fn}"))
cluster_profile
Cluster Interpretation:
Cluster 1 - Affordable but Lower-Quality
Cluster 2 - High-Quality and Wealthy Cities
Cluster 3 - Urban-Pressure Cities
world <- ne_countries(scale = "medium", returnclass = "sf")
world_cont <- world |>
dplyr::group_by(continent) |>
dplyr::summarise(geometry = sf::st_union(geometry))
region_cluster <- city |>
dplyr::count(country, cluster) |>
dplyr::group_by(country) |>
dplyr::slice_max(n, n = 1, with_ties = FALSE)
region_cluster
map_data <- world_cont |>
dplyr::left_join(region_cluster,
by = c("continent" = "country"))
ggplot(map_data) +
geom_sf(aes(fill = cluster), color = "white") +
scale_fill_brewer(palette = "Set2", na.value = "grey90") +
theme_minimal() +
labs(title = "Dominant City Lifestyle Cluster by World Region",
fill = "Cluster")